Packet Traffic Learning

Proposal

Project description: Our project aims to develop a predictive model to detect anomalous network behavior using packet-level and statistical features derived from network traffic. With machine learning models, we aim to accurately classify and predict network anomalies, which is essential for intrusion detection, network security monitoring, and incident response automation.
Author
Affiliation

The Anomalists - Joey Garcia, David Kyle

College of Information Science, University of Arizona

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats # for analysis plan

Dataset

network_traffic_train = pd.read_csv('data/KDDTrain+.txt')
network_traffic_test = pd.read_csv('data/KDDTest+.txt')

df_train = network_traffic_train.copy()
df_test = network_traffic_test.copy()

'''
Columns recieved from kaggle project 
https://www.kaggle.com/code/faizankhandeshmukh/intrusion-detection-system

'''

# Define the list of column names based on the NSL-KDD dataset description
columns = [
    'duration', 'protocol_type', 'service', 'flag', 'src_bytes', 'dst_bytes',
    'land', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins',
    'logged_in', 'num_compromised', 'root_shell', 'su_attempted', 'num_root',
    'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
    'is_host_login', 'is_guest_login', 'count', 'srv_count', 'serror_rate',
    'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate',
    'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count',
    'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
    'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
    'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate',
    'dst_host_srv_rerror_rate', 'attack', 'level'
]

# Assign the column names to the dataframe
df_train.columns = columns
df_test.columns = columns


print('Shapes (train, test):', df_train.shape, df_test.shape)
Shapes (train, test): (125972, 43) (22543, 43)

We are using a training and testing dataset of network intrusion detection from NSL-KDD from Kaggle. The intrusion detection network traffic training dataset contains 125,972 rows and 43 columns, and 22,543 rows and 43 columns in the test dataset.

The attack field indicates normal or anomalous (multi-class) observations which allows us to use learning approaches for classifying anomalous network activity. A new binary classification feature, is_anomalous, will be added to indicate if the network connection was anomalous or not. This will be the target field for the project.

We chose this dataset because it provides a rich and realistic representation of network traffic data. The presence of labeled data allows us to train and evaluate supervised models; the diversity and volume of traffic patterns make it well-suited for exploring unsupervised anomaly detection techniques as well. This balance between complexity and feature richness aligns well with our research questions and modeling goals.

Questions

Q1. Using supervised machine learning models such as Long Short-Term Memory (LSTM) and Support Vector Machines (SVMs), can we accurately calssify network traffic as normal and anomalous based on labeled data? How do their performances compare in terms of accuracy, precision, recall, and F1-score?

Q2. Can unsupervised learning methods such as K-Means Clustering and Density-Based Clustering (DBSCAN) detect anomalous patterns in network traffic without using labeled data?

Summary. How do the supervised and unsupervised approaches compare?

Dataset Analysis

Variables

Column Name Data Type Description
duration int64 Length (in seconds) of the connection.
protocol_type object Protocol used (e.g., tcp, udp, icmp).
service object Network service on the destination (e.g., http, telnet).
flag object Status flag of the connection.
src_bytes int64 Number of data bytes sent from source to destination.
dst_bytes int64 Number of data bytes sent from destination to source.
land int64 1 if connection is from/to the same host/port; 0 otherwise.
wrong_fragment int64 Number of wrong fragments.
urgent int64 Number of urgent packets.
hot int64 Number of “hot” indicators.
num_failed_logins int64 Number of failed login attempts.
logged_in int64 1 if successfully logged in; 0 otherwise.
num_compromised int64 Number of compromised conditions.
root_shell int64 1 if root shell is obtained; 0 otherwise.
su_attempted int64 1 if “su root” command attempted; 0 otherwise.
num_root int64 Number of “root” accesses.
num_file_creations int64 Number of file creation operations.
num_shells int64 Number of shell prompts invoked.
num_access_files int64 Number of accesses to control files.
num_outbound_cmds int64 Number of outbound commands (always 0 in KDD99).
is_host_login int64 1 if login is to a host account; 0 otherwise.
is_guest_login int64 1 if login is to a guest account; 0 otherwise.
count int64 Number of connections to the same host in the past 2 seconds.
srv_count int64 Number of connections to the same service in the past 2 seconds.
serror_rate float64 % of connections with SYN errors.
srv_serror_rate float64 % of connections to the same service with SYN errors.
rerror_rate float64 % of connections with REJ errors.
srv_rerror_rate float64 % of connections to the same service with REJ errors.
same_srv_rate float64 % of connections to the same service.
diff_srv_rate float64 % of connections to different services.
srv_diff_host_rate float64 % of connections to different hosts on the same service.
dst_host_count int64 Number of connections to the destination host.
dst_host_srv_count int64 Number of connections to the destination host and service.
dst_host_same_srv_rate float64 % of connections to the same service on the destination host.
dst_host_diff_srv_rate float64 % of connections to different services on the destination host.
dst_host_same_src_port_rate float64 % of connections from the same source port.
dst_host_srv_diff_host_rate float64 % of connections to the same service from different hosts.
dst_host_serror_rate float64 % of connections with SYN errors to the destination host.
dst_host_srv_serror_rate float64 % of connections with SYN errors to the destination service.
dst_host_rerror_rate float64 % of connections with REJ errors to the destination host.
dst_host_srv_rerror_rate float64 % of connections with REJ errors to the destination service.
attack object Label indicating the type of attack or “normal”.
level int64 Severity or confidence score of the attack (if available).

Exploratory Data Analysis

We take a quick look at the training data to see if there are any obvious imbalances.

print("Shape:", df_train.shape)
print("Missing values:", df_train.isna().sum().sum())
print("Duplicates:", df_train.duplicated().sum())
print("Unique attack labels:", df_train['attack'].nunique())
print("Attack label distribution:\n", df_train['attack'].value_counts().head(5))

# Show types and non-null counts
df_train.info(verbose=False)
Shape: (125972, 43)
Missing values: 0
Duplicates: 0
Unique attack labels: 23
Attack label distribution:
 attack
normal       67342
neptune      41214
satan         3633
ipsweep       3599
portsweep     2931
Name: count, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125972 entries, 0 to 125971
Columns: 43 entries, duration to level
dtypes: float64(15), int64(24), object(4)
memory usage: 41.3+ MB

The brief look at the data is positive. There are plenty of data points and features, depending on time speeds for fitting models, we may decrease our sample size because hyper parameter using GridSearchCV training may be time intensive. In the data there are no missing values, and the glimpse of the attack column provides insight into why we want to collapse it into a binary column.

Normal vs Anomalous Traffic

First, look at the amount of normal v. anomalous data.

attack
normal             67342
neptune            41214
satan               3633
ipsweep             3599
portsweep           2931
smurf               2646
nmap                1493
back                 956
teardrop             892
warezclient          890
pod                  201
guess_passwd          53
buffer_overflow       30
warezmaster           20
land                  18
imap                  11
rootkit               10
loadmodule             9
ftp_write              8
multihop               7
phf                    4
perl                   3
spy                    2
Name: count, dtype: int64

The plot provides an idea of the specific attack types expressesd in the data. The plot communicates why it makes sense to group all non-normal traffic together.

We feature engineer a new column, is_anomalous, this contains 0 if the connection is normal and 1 if the connection is not normal.

# create binary target column: 1 = attack, 0 = normal

df_train['is_anomalous'] = df_train['attack'].apply(
  lambda x: 0 if x == 'normal' else 1)

Examine the new column, is_anaomalous, to get an idea of the target frequency.

Count Percentage
is_anomalous
Normal 67342 53.46
Attack 58630 46.54

The is_anomalous classification target shows a near-even class distribution indicating the the dataset is well balanced. There should be no need for resampling or class weighting to correct the set. It appears this dataset will be a good candidate for learning models.

Analysis plan

Problem Introduction

The project is to build and evaluate models capable of detecting anomalous network traffic based on connection-level features from the NSL-KDD dataset. The problem is framed as a binary classification task, where each record is labeled as either normal or anomalous. This has real-world applications in intrusion detection systems and network security monitoring.

The project will explore both supervised and unsupervised machine learning techniques to assess their effectiveness in identifying attacks from structured network traffic data.

Feature Engineering Strategy

To ensure a fair and consistent comparison, we will apply the same feature engineering pipeline to both supervised and unsupervised models. All features will be assigned appropriate column names based on the NSL-KDD documentation. Categorical variables such as protocol_type, service, and flag will be one-hot encoded, and low-variance or non-informative columns will be removed. Numeric features will be standardized using a scaler to normalize their ranges.

For supervised models, these engineered features will be used alongside the binary target is_anomalous. For unsupervised models, the same processed features will be used without labels, allowing the models to explore underlying structure or detect anomalous patterns. This consistent preprocessing ensures that differences in performance can be attributed to the modeling approaches rather than inconsistencies in data preparation.

Dimensionality Reduction

Our dataset currently has over 40 features, we will apply a combination of feature reduction techniques such as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), or even use the Random Forest Classifier feature importance attribute. We’ll experiment to determine the most optimal feature subset for our classification task.

Q1. Supervised Learning

For the supervised learning portion of the project, we will train and evaluate models using the labeled NSL-KDD dataset to classify network traffic as normal or anomalous. Specifically, we will implement and compare a Long Short-Term Memory (LSTM) neural network and a Support Vector Machine (SVM). These models will be trained on the same feature-engineered data, using the is_anomalous column as the target. Model performance will be assessed using standard classification metrics, including accuracy, precision, recall, F1-score, and ROC AUC.

Q2. Unsupervised Learning

For the unsupervised learning portion of the project, we will explore clustering-based approaches to detect anomalies in network traffic without relying on labeled data. We plan to experiment with techniques such as K-Means and DBSCAN to group similar observations and identify outliers that may correspond to attacks. After clustering, we will evaluate how well the resulting groupings align with the true labels using appropriate metrics for unsupervised learning. This will help us assess the potential of unsupervised models to detect anomalous behavior in the absence of supervision.

Summary Comparison

To compare the supervised and unsupervised approaches, we will evaluate their ability to correctly identify anomalous traffic using relevant metrics for each method. We will also consider practical factors such as interpretability, scalability, and the need for labeled data.

Project Timeline

Task Name Status Due Priority Summary
Dataset exploration In Progress Week 1 High Load the dataset, inspect features, handle any preprocessing needs.
Define research questions Complete Week 1 High Clarify goals for supervised and unsupervised anomaly detection.
Supervised model development Not Started Week 2 High Train models like Random Forest, Logistic Regression, and XGBoost.
Evaluation of supervised models Not Started Week 3 High Use accuracy, precision, recall, F₁, and ROC-AUC to assess performance.
Unsupervised model development Not Started Week 3 Medium Explore methods like Isolation Forest and clustering.
Evaluation of unsupervised models Not Started Week 4 Medium Compare anomaly scores to labeled data using precision-recall metrics.
Comparative analysis Not Started Week 4 High Analyze strengths and weaknesses of both approaches.
Final report & presentation Not Started Week 5 High Compile results, figures, and discussion into final deliverables.